This dataset contains of time series data with almost two year long indoor air quality meausurements of an office space in building of University of Ljublana (UL) in Slovenia, Europe.
Choosing different colors to represent:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(style='whitegrid',
rc={'figure.figsize': (11.7,8.27), 'xtick.major.size': 12, 'ytick.major.size': 12,
'font.size': 14, 'axes.titlesize': 14, 'axes.labelsize': 14})
%matplotlib inline
# Colors for plots
current_pallete = sns.color_palette()
pallete = sns.color_palette().as_hex()
# Hex codes for BASELINE and MOBISTYLE colors
color_BL = '#1f77b4'
color_MS = '#ff7f0e'
color_other = '#2ca02c'
color_out = '#7f7f7f'
color_missing = 'lightgrey'
# Create dictionary pallete for Monitoring period
pal = dict(MOBISTYLE=color_MS, BASELINE=color_BL)
# Plot Colors for BASELINE and MOBISTYLE
sns.palplot([color_BL, color_MS])
# Set bins and labels for Indoor environmental parameters
# The bin ranges ar based on an European building design standard DS/EN 15251:2007
bins_TEMP, bins_RH = [-10000, 19, 20, 21, 23, 24, 25, 10000], [-10000, 20, 25, 30, 50, 60, 70, 10000]
bins_CO2, bins_VOC = [-10000, 750, 900, 1200, 10000], [-10000, 40, 80, 100, 10000]
labels_T_RH = ['Cat -IV','Cat -III', 'Cat -II','Cat I', 'Cat +II','Cat +III','Cat +IV']
labels_CO2_VOC = ['Cat I', 'Cat II', 'Cat III', 'Cat IV']
# RGB codes for Comfort category colors
cmap_T_RH = [(0, .33, .82), (0, .7, .82), (.5, .95, .75), (.3, .7, .4), (.6, .8, .4), (.95, .39, .4), (.8, .07, .25), 'lightgrey']
cmap_CO2_VOC = [(.3, .7, .4), (.6, .8, .4), (.95, .39, .4), (.8, .07, .25), 'lightgrey']
# Plot diverging color pallete for Temperature and relative humidity categories
# Grey color for the missing data
sns.palplot(cmap_T_RH)
# Plot color pallete for CO2 and VOC level categories
# Grey color for the missing data
sns.palplot(cmap_CO2_VOC)
Before exploratory data analysis preliminary data wrangling was performed, see 'data_exploration.ipynb' for further details.
In this chapter the dataset is loaded and its variables are described in order to motivate data exploration goals.
# Assign categorical data types before reading CSV file
dtypes = {
'Monitoring_Period': 'category',
'Season': 'category',
'Category_TEMP': 'category',
'Category_RH': 'category',
'Category_CO2': 'category',
'Category_VOC': 'category',
}
# read CSV file
df = (pd.read_csv('./Files/office_air_quality_data.csv',
parse_dates=True,
dtype=dtypes,
index_col='Timestamp')
)
Dataset contains indoor and outdoor air parameter measurements, window opening state and data on if the employees are present at the office, temperature setpoint regulation, heating and cooling valve operation.
Almost 2 year time series data with 15 minutes observation intervals.
This data can further be divided into two monitoring periods, denominated as:
'BASELINE' - First year of the monitoring to get reference data on the specific office space (with no interventions).'MOBISTYLE' - Second year of the monitoring, when employees are given a mobile app and sensors with LED lights (signaling a bad air quality in the office room) are installed. fig, ax = plt.subplots(figsize=(11.7, 4), sharex=True)
df.loc[:, ['Temperature', 'Outdoor_Temperature']].plot(title='Office and Outdoor Air Temperature', color=['darkblue', 'grey'],
linewidth=.4, ax=ax)
ax.axvline(x='02-01-2019', color='orange', linestyle='--', linewidth=1)
ax.fill_between(df.loc['02-01-2019':, :].index.values, -15, 40, facecolor=color_MS, alpha=0.1)
ax.fill_between(df.loc[:'02-01-2019', :].index.values, -15, 40, facecolor=color_BL, alpha=0.1)
ax.text(x='02-01-2019', y=-14, s='LED, Mobile App', color='red', size=12);
ax.text(x='02-01-2019', y=37, s='MOBISTYLE', color=color_MS, size=12);
ax.text(x='02-01-2018', y=37, s='BASELINE', color=color_BL, size=12);
ax.set(ylim=(-15,40), xticks=pd.date_range(start='2018-02-1', periods=12, freq='2MS'), xlabel='', ylabel='($^o$C)');
ax.set_xticklabels(pd.date_range(start='2018-02-1', periods=12, freq='2MS').strftime('%b, %Y'), rotation=90);
Continuous variables
Indoor air parameters like Temperature, Relative Humidity(RH), Carbon Dioxide (CO2) and Volatile Organic Compunds (VOC) levelsOutdoor air parameters like Temperature, RH, Global radiation, Diffuse radiationCategorical Variables
Season: Heating or Cooling season of the yearMonitoring period: BASELINE and MOBISTYLERoom status: 0 - unoccupied room, 1 - occupied roomWindow state: 0 - closed window, 1 - open windowWindow state change: 1 - opening of the window, 0 - no change,-1 is closing of the windowComfort categories: Indoor Air Quality (IAQ) of the Temperature, RH, CO2, VOC are labeled within specific paremeter ranges based on an European building design standard DS/EN 15251:2007 (see the Plotting setup section above for more details)# Variable dtypes
# Note: Due to np.nan values Room status, Window state and change are represent as float type
df.dtypes
The alternative hypothesis for this project is that the employee use of mobile app and installing sensors with LED lights (that signalize high CO2 levels) would improve the air quality in the office, and would also urge the employees to open the windows more often and keep windows open for a longer period of time than before.
Main interest of this dataset is to investigate if during the second monitoring period, denominated as 'MOBISTYLE' the:
Comparison of IAQ parameter distribution between both monitoring periods. Estimate indoor air parameter correlation with outdoor climate measurements. It is important to compare the outdoor climate difference between
'MOBISTYLE'and'BASELINE', because one measurement year could have been warmer/or colder than the other and thus have an impact to the data comparison. Also take into account the seasonality of the monitoring data by dividing the data into - Heating and Cooling seasons.Dataset constains 15 minutes values including weekends and night time. Therefore, before conducting the anlaysis, the dataset must be filtered for the time only when the employees are present in the office in order to account properly for the indoor climate parameters during working hours.
# Filter the data, exclude the time outsied working hours
data = df.query('Room_Status == 1')
# Verify that all values are equal to 1 (Office room occupied)
assert data['Room_Status'].all() == 1
# Dataset was reduced by almost 2/3 after filtering
data.info()
The main purpose of this project is to compare if there are any air quality improvements during the MOBISTYLE monitoring periods.
It may be hard to evaluate the differences by comparing raw 15 minute values by plotting yearly temperature changes as the data flcutuates a lot, thus the dataset must be compared at larger time scales.
The further investigation of the air quality parameters will be performed by increasing the timestep from original 15 minutes to monthly, then to seasonal (heating, cooling) comparison, and in the end to whole monitoring period MOBISTYLE vs BASELINE will be compared.
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
fig, axes = plt.subplots(2,4, figsize=(11.7, 4), sharey=True)
air_cols = ['Temperature', 'RH', 'CO2', 'VOC', 'Outdoor_Temperature',
'Outdoor_RH', 'Global_radiation', 'Diffuse_radiation']
kws = dict(kde=False, hist=True, kde_kws={'shade': True})
for col, ax in zip(air_cols, axes.flat):
sns.distplot(data[col], **kws, color=color_BL, ax=ax);
plt.tight_layout()
fig, axs = plt.subplots(2,4, figsize=(11.7, 4), sharey=True)
(ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8) = axs
kws = dict(kde=False, hist=True, kde_kws={'shade': True})
for period, color in pal.items():
df_plot = data.query(f'Monitoring_Period == "%s"'% period);
# Indoor
sns.distplot(df_plot['Temperature'], **kws, color=color, ax=ax1);
sns.distplot(df_plot['RH'], **kws, color=color, ax=ax2);
sns.distplot(df_plot['CO2'], **kws, color=color, ax=ax3);
sns.distplot(df_plot['VOC'], **kws, color=color, ax=ax4);
# Outdoor
sns.distplot(df_plot['Outdoor_Temperature'], **kws, color=color, ax=ax5);
sns.distplot(df_plot['Outdoor_RH'], **kws, color=color, ax=ax6);
sns.distplot(df_plot['Global_radiation'], **kws, color=color, ax=ax7);
sns.distplot(df_plot['Diffuse_radiation'], **kws, color=color, ax=ax8);
plt.tight_layout()
Investigate closer Temperature and RH parameters as they have two peaks - indicating sesonal differences
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(11.7, 4), sharey=True)
kws = dict(kde=True, hist=False, kde_kws={'shade': True})
for period, color in pal.items():
df_plot = data.query(f'Monitoring_Period == "%s"'% period);
sns.distplot(df_plot['Temperature'], **kws, color=color, bins=np.arange(16, 30, .5), ax=ax1);
sns.distplot(df_plot['RH'], **kws, color=color, bins=np.arange(0, 100, 5), ax=ax2);
plt.legend(pal.keys())
plt.tight_layout()
From kernel density plots and histograms two clear peaks for the office temperature can be observed. This indicates clear seasonal differences for parameters like Temperature and also to some extent for the RH. CO2 and VOC levels don't have such distributions.
The data at first level is grouped by the Monitoring period:
'MOBISTYLE' 'BASELINE'Afterwards it is further grouped into summer and winter periods, or as denoted here for:
'HEATING''COOLING'This is important because air parameters like temperature, relative humidity are depend on the season of the year. As we could se in the previous comparison.
Also monthly parameter distributions are created to look closer at the summer months.
First, investigate monthly outdoor climate differences.
fig, axes = plt.subplots(4,1, figsize=(11.7, 12), sharex=True)
for col, ax in zip(air_cols[4:], axes.flat):
sns.boxplot(data=data, x=data.index.month, y=col, hue='Monitoring_Period',
showfliers=False, palette=[color_BL, color_MS], ax=ax)
ax.set_xlabel('', fontsize=14)
ax.set_xticklabels(pd.date_range(start='2018-1-1', periods=12, freq='MS').strftime('%b'), rotation=0)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.legend(fontsize=14)
ax.xaxis.grid(True)
axes[0].set_title('Outdoor climate. Slovenia, Ljubljana (Bežigrad)', fontsize=14)
axes[0].set_ylabel('Outdoor Temperature ($^o$C)', fontsize=14)
axes[1].set_ylabel('Outdoor RH (%)', fontsize=14)
axes[2].set_ylabel('Global Radiation (W/m2)', fontsize=14)
axes[3].set_ylabel('Diffuse Radiation (W/m2)', fontsize=14)
plt.tight_layout()
fig, axes = plt.subplots(4,1, figsize=(11.7, 12), sharex=True)
for col, ax in zip(air_cols[:4], axes.flat):
sns.boxplot(data=data, x=data.index.month, y=col, hue='Monitoring_Period',
showfliers=False, palette=[color_BL, color_MS], ax=ax)
ax.set_xlabel('', fontsize=14)
ax.set_xticklabels(pd.date_range(start='2018-1-1', periods=12, freq='MS').strftime('%b'), rotation=0)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.legend(fontsize=14)
ax.xaxis.grid(True)
axes[0].set_title('Office indoor climate. University of Ljubljana, Slovenia', fontsize=14)
axes[0].set_ylabel('Temperature ($^o$C)', fontsize=14)
axes[1].set_ylabel('RH (%)', fontsize=14)
axes[2].set_ylabel('CO2 (ppm)', fontsize=14)
axes[3].set_ylabel('VOC (ppb)', fontsize=14)
plt.tight_layout()
Observations for IAQ parameter diffences:
# Create Kernel Density plots without histograms to overlay the distributions
g = sns.FacetGrid(data, col='Season', hue='Monitoring_Period',
height=3.5, aspect=1.5, palette=pal)
g = (g.map(sns.distplot, 'Temperature', bins=np.arange(16, 30, .5),
kde=True, hist=True, kde_kws={'shade': True})
.add_legend()
.set_xlabels('Office Temperature ($^o$C)'))
Investigate the following window data:
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(11.7, 8), sharex=True)
# Resample montlhy window opening count and average time open
df_OP = df[['Window_State_Change']].resample('MS').apply(lambda x: x.isin([1]).sum())
df_OP['Window_State'] = df[['Window_State']].resample('MS').mean() * 100
df_OP.loc[:'2019-02-1', 'Monitoring_Period'] = 'BASELINE'
df_OP.loc['2019-02-1': , 'Monitoring_Period'] = 'MOBISTYLE'
# Plot
sns.barplot(data=df_OP, x=df_OP.index.month, y='Window_State_Change', hue='Monitoring_Period',
palette=pal, ax=ax1);
sns.barplot(data=df_OP, x=df_OP.index.month, y='Window_State', hue='Monitoring_Period',
palette=pal, ax=ax2);
# Format axes
ax1.set(ylabel='Opening (count)', xlabel='')
ax2.set(ylabel='Time open (pct)', xlabel='')
ax2.set_xticklabels(pd.date_range(start='2018-1-1', periods=12, freq='MS').strftime('%b'), rotation=0)
[ax.legend(loc='upper right', fontsize=14) for ax in (ax1, ax2)]
plt.tight_layout()
g = sns.catplot(x='Category_TEMP', hue='Monitoring_Period', col='Season',
data=data, kind='count', order=labels_T_RH,
height=3.5, aspect=1.5);
# Create a function to count categories, return a DataFrame
def stats_category(df, cat_name):
"""
This f-n calculates time distribution of indoor air parameters in comfort categories
and percentage of missing data.
"""
if cat_name in ('Category_TEMP', 'Category_RH'):
df_cat = pd.DataFrame(0, index=[cat_name], columns=labels_T_RH + ['Missing data'])
for cat in labels_T_RH:
df_cat.loc[cat_name, cat] = df[cat_name].isin([cat]).sum() * 100 / len(df[cat_name])
else:
df_cat = pd.DataFrame(0, index=[cat_name], columns=labels_CO2_VOC + ['Missing data'])
for cat in labels_CO2_VOC:
df_cat.loc[cat_name, cat] = df[cat_name].isin([cat]).sum() * 100 / len(df[cat_name])
df_cat.loc[cat_name, 'Missing data'] = df[cat_name].isna().sum() * 100 / len(df[cat_name])
return df_cat.fillna(0)
stats_category(data, 'Category_TEMP')
# Function to plot text in the middel of horizontal barcharts
# Input params: DataFrame and Axes object
def set_barh_text(df, ax):
for rowNum, row in enumerate(df.fillna(0.).values):
xpos = 0
for val in row:
xpos += val
ax.text(xpos - val/2, rowNum, np.where((val >1.), f'{int(round(val))}', ''), color='white', ha='center', va='center', fontsize=10)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(11.7, 4), sharey=True)
# Group comfort catgory data for each period and season
df_cat_T = (data
.groupby(['Monitoring_Period', 'Season']).apply(stats_category, 'Category_TEMP')
.reset_index().set_index('Season')
)
BL_T = df_cat_T.query('Monitoring_Period == "BASELINE"').loc[:, labels_T_RH + ['Missing data']]
MS_T = df_cat_T.query('Monitoring_Period == "MOBISTYLE"').loc[:, labels_T_RH + ['Missing data']]
# Plot
BL_T.plot(kind='barh', stacked=True, color=cmap_T_RH, legend='', ax=ax1)
MS_T.plot(kind='barh', stacked=True, color=cmap_T_RH, legend='', ax=ax2)
# Add percentage tect to bars
set_barh_text(BL_T, ax1)
set_barh_text(MS_T, ax2)
# Format Axes and Figure
for ax in (ax1, ax2):
ax.set(xlim=(0,100), xticklabels=[0, 20, 40, 60, 80, '100%'], ylabel='')
ax.tick_params(axis='both', which='major', labelsize=14)
ax.legend('')
ax1.set_title('BASELINE')
ax2.set_title('MOBISTYLE')
plt.suptitle(f'Time Distribution (%) in Comfort Categories. Temperature', fontsize=14)
# Add legend
fig.legend(labels_T_RH + ['Missing data'], loc='lower center', bbox_to_anchor=(0.5, 0.0), ncol=8, fontsize=12)
fig.tight_layout()
fig.subplots_adjust(top=0.85, bottom=0.2)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(11.7, 4), sharey=True)
# Group comfort catgory data for each period and season
df_cat_T = (data
.groupby(['Monitoring_Period', 'Season']).apply(stats_category, 'Category_RH')
.reset_index().set_index('Season')
)
BL_T = df_cat_T.query('Monitoring_Period == "BASELINE"').loc[:, labels_T_RH + ['Missing data']]
MS_T = df_cat_T.query('Monitoring_Period == "MOBISTYLE"').loc[:, labels_T_RH + ['Missing data']]
# Plot
BL_T.plot(kind='barh', stacked=True, color=cmap_T_RH, legend='', ax=ax1)
MS_T.plot(kind='barh', stacked=True, color=cmap_T_RH, legend='', ax=ax2)
# Add percentage tect to bars
set_barh_text(BL_T, ax1)
set_barh_text(MS_T, ax2)
# Format Axes and Figure
for ax in (ax1, ax2):
ax.set(xlim=(0,100), xticklabels=[0, 20, 40, 60, 80, '100%'], ylabel='')
ax.tick_params(axis='both', which='major', labelsize=14)
ax.legend('')
ax1.set_title('BASELINE')
ax2.set_title('MOBISTYLE')
plt.suptitle(f'Time Distribution (%) in Comfort Categories. RH', fontsize=14)
# Add legend
fig.legend(labels_T_RH + ['Missing data'], loc='lower center', bbox_to_anchor=(0.5, 0.0), ncol=8, fontsize=12)
fig.tight_layout()
fig.subplots_adjust(top=0.85, bottom=0.2)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(11.7, 4), sharey=True)
# Group comfort catgory data for each period and season
df_cat_T = (data
.groupby(['Monitoring_Period', 'Season']).apply(stats_category, 'Category_CO2')
.reset_index().set_index('Season')
)
BL_T = df_cat_T.query('Monitoring_Period == "BASELINE"').loc[:, labels_CO2_VOC+ ['Missing data']]
MS_T = df_cat_T.query('Monitoring_Period == "MOBISTYLE"').loc[:, labels_CO2_VOC + ['Missing data']]
# Plot
BL_T.plot(kind='barh', stacked=True, color=cmap_CO2_VOC, legend='', ax=ax1)
MS_T.plot(kind='barh', stacked=True, color=cmap_CO2_VOC, legend='', ax=ax2)
# Add percentage tect to bars
set_barh_text(BL_T, ax1)
set_barh_text(MS_T, ax2)
# Format Axes and Figure
for ax in (ax1, ax2):
ax.set(xlim=(0,100), xticklabels=[0, 20, 40, 60, 80, '100%'], ylabel='')
ax.tick_params(axis='both', which='major', labelsize=14)
ax.legend('')
ax1.set_title('BASELINE')
ax2.set_title('MOBISTYLE')
plt.suptitle(f'Time Distribution (%) in Comfort Categories. CO2', fontsize=14)
# Add legend
fig.legend(labels_CO2_VOC + ['Missing data'], loc='lower center', bbox_to_anchor=(0.5, 0.0), ncol=8, fontsize=12)
fig.tight_layout()
fig.subplots_adjust(top=0.85, bottom=0.2)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(11.7, 4), sharey=True)
# Group comfort catgory data for each period and season
df_cat_T = (data
.groupby(['Monitoring_Period', 'Season']).apply(stats_category, 'Category_VOC')
.reset_index().set_index('Season')
)
BL_T = df_cat_T.query('Monitoring_Period == "BASELINE"').loc[:, labels_CO2_VOC+ ['Missing data']]
MS_T = df_cat_T.query('Monitoring_Period == "MOBISTYLE"').loc[:, labels_CO2_VOC + ['Missing data']]
# Plot
BL_T.plot(kind='barh', stacked=True, color=cmap_CO2_VOC, legend='', ax=ax1)
MS_T.plot(kind='barh', stacked=True, color=cmap_CO2_VOC, legend='', ax=ax2)
# Add percentage tect to bars
set_barh_text(BL_T, ax1)
set_barh_text(MS_T, ax2)
# Format Axes and Figure
for ax in (ax1, ax2):
ax.set(xlim=(0,100), xticklabels=[0, 20, 40, 60, 80, '100%'], ylabel='')
ax.tick_params(axis='both', which='major', labelsize=14)
ax.legend('')
ax1.set_title('BASELINE')
ax2.set_title('MOBISTYLE')
plt.suptitle(f'Time Distribution (%) in Comfort Categories. VOC', fontsize=14)
# Add legend
fig.legend(labels_CO2_VOC + ['Missing data'], loc='lower center', bbox_to_anchor=(0.5, 0.0), ncol=8, fontsize=12)
fig.tight_layout()
fig.subplots_adjust(top=0.85, bottom=0.2)
Your answer here!
Your answer here!
From the univariate distribution it could be seen that some IAQ parameters have distributions with 2 peaks, indicating that either the data is different for each Season (HEATING, COOLING) or for each montiroing period (BASELINE, MOBISTYLE). Therefore in the following bivariate exploration
From the pairplot figure below it can be seen tha some air parameters have a clear seasonal distribution (Temperature, RH)
Furthermore, it can be seen that CO2 levels are stronlgy correlated with VOC levels, and they have very staright line.
sns.pairplot(data.iloc[:, 3:], hue='Season');
More equal distribution of the IAQ parameters can be observed in the figure below, than grouping by Season previously.
Office RH has the largest distribution differences.
Also, CO2 and VOC levels are clearly divided by monitorin period
sns.pairplot(data.iloc[:, 3:-1], hue='Monitoring_Period');
As it could be observed from the previos bivariate plots, both indoor and outdoor temperatures are strongly correlated
From the images below, seasonal difference can be seen
However, it is hard to clearly see the differences between monitoring periods in this graph. office temperature seems to be at the same range for both periods
# F-n to plot Hexbin using FacetGrid
def hexbin(x, y, color, **kwargs):
cmap = sns.light_palette(color, as_cmap=True)
plt.hexbin(x, y, gridsize=20, cmap=cmap, **kwargs)
# Plot by monitoring period
with sns.axes_style("dark"):
g = sns.FacetGrid(data, hue='Monitoring_Period', col='Monitoring_Period', height=4)
g = (g.map(hexbin, 'Temperature', 'Outdoor_Temperature', extent=[15, 30, -5, 35])
.set_axis_labels('Office Temperature ($^o$C)', 'Outdoor Temperature ($^o$C)'));
# Plot by Monitoring period and Season
with sns.axes_style("dark"):
g = sns.FacetGrid(data, hue='Monitoring_Period', col='Monitoring_Period', row='Season', height=4)
g = (g.map(hexbin, 'Temperature', 'Outdoor_Temperature', extent=[15, 30, -5, 35])
.set_axis_labels('Office Temperature ($^o$C)', 'Outdoor Temperature ($^o$C)'));
# Plot by Monitoring period and Season
with sns.axes_style("dark"):
g = sns.FacetGrid(data, hue='Monitoring_Period', col='Monitoring_Period', row='Season', height=4)
g = (g.map(hexbin, 'RH', 'Outdoor_RH', extent=[0, 100, 0, 100])
.set_axis_labels('Office RH (%)', 'Outdoor RH (%)'));
After plotting pair plots and grouping the variables , lets calculate Spearmann's Correlation coefficient between all continuous numeric variables.
It is important to note, that in the previous plots window data could not be evaluated, because the 'Window_State' column shows only the state at each timestamp (0,1). Therefore, daily average values must be calculated to in order to estimate how long the window is kept open each day
# Select numeric variables, calculate daily mean values and include window state column
df_daily = data.iloc[:, np.r_[1, 3:9, 13:17]].resample('D').mean().dropna(how='all')
df_daily.loc[:'2019-02-1', 'Monitoring_Period'] = 'BASELINE'
df_daily.loc['2019-02-1': , 'Monitoring_Period'] = 'MOBISTYLE'
# Correlation matrix for daily mean values
df_daily.corr()
# Plot only upper matrix
fig, ax = plt.subplots(figsize=(11.7, 8.27))
mask = np.zeros_like(df_daily.corr(method='spearman'))
mask[np.triu_indices_from(mask)] = True
# Format labels
labels = [col.replace('_', ' ') for col in df_daily.columns]
hm = sns.heatmap(df_daily.corr(method='spearman'), square=True,
cmap='RdBu_r', linewidths=.5, annot=True, fmt='.2f',
xticklabels=labels, yticklabels=labels,
mask=mask, vmax=1., vmin=-1., ax=ax);
hm.figure.axes[-1].set_ylabel('Correlation coeficient', size=16)
hm.tick_params(labelsize=14)
hm.set_title("Spearman's correlation matrix. Daily average values", fontsize=16)
plt.tight_layout()
As expected the office air temperature has a strong correlation with the outdoor air temperature
Furthermore, window state - percentage of time when window is open - is also correlated with outdoor air temperaure - thus idicating seasonality. Therefore, in order to compare both monitoring periods, data must be compared for Heating and coling season separately.
From correlation matrix it could be seen that some CO2 levels are stronlgy correlated with VOC levels, it means that both parameters are related to the indoor air polution from employees.
Global radiation has strong negative correlation with Outdoor RH.
# Group by Monitoring period and season, resample and agregate daily mean values
data_daily = data.groupby(['Monitoring_Period', 'Season']).resample('D').mean().dropna(how='all').reset_index().set_index('Timestamp')
data_daily.head(3)
fig, (ax1, ax2)= plt.subplots(1,2, figsize=(11.7, 4), sharey=True, sharex=True)
for period, ax in zip(['BASELINE', 'MOBISTYLE'], (ax1, ax2)):
mask = (data_daily['Monitoring_Period'] == period)
x, y = data_daily.loc[mask, 'Temperature'], data_daily.loc[mask, 'Outdoor_Temperature']
z = data_daily.loc[mask, 'Window_State'] * 100
sct = ax.scatter(x, y, c=z, cmap='GnBu')
ax.set_title(period, size=14)
ax.set(xlabel='Office Temperature ($^o$C)')
ax1.set_ylabel('Outdoor Temperature ($^o$C)')
# Set color bar for Window State
cbar = plt.colorbar(sct)
cbar.ax.set_ylabel('Window open (pct)', size=12, rotation=90, va='top')
cbar.ax.yaxis.set_ticks_position('right')
plt.suptitle('Window open state, Indoor and Outdoor Temperature', size=14)
plt.tight_layout()
fig.subplots_adjust(top=0.8)
fig, (ax1, ax2)= plt.subplots(1,2, figsize=(11.7, 4), sharey=True, sharex=True)
for season, ax in zip(['HEATING', 'COOLING'], (ax1, ax2)):
mask = (data_daily['Season'] == season)
x, y = data_daily.loc[mask, 'Temperature'], data_daily.loc[mask, 'Outdoor_Temperature']
z = data_daily.loc[mask, 'Window_State'] * 100
sct = ax.scatter(x, y, c=z, cmap='GnBu')
ax.set_title(season, size=14)
ax.set(xlabel='Office Temperature ($^o$C)')
ax1.set_ylabel('Outdoor Temperature ($^o$C)')
# Set color bar for Window State
cbar = plt.colorbar(sct)
cbar.ax.set_ylabel('Window open (pct)', size=12, rotation=90, va='top')
cbar.ax.yaxis.set_ticks_position('right')
plt.suptitle('Window open state, Indoor and Outdoor Temperature', size=14)
plt.tight_layout()
fig.subplots_adjust(top=0.8)
Office window opening time has a moderate strong correlation with outdoor air temperature. Similarly, office air temperature is related to the outdoor air temperature. This indicates clear seasonality of the dataset.
Grouping data by monitoring period, it could be seen that employees were keeping open windows more in the BASELINE (both seasons)
Grouping data by season, it could be seen that employees open windows more in the COOLING season
I discovered that CO2 and VOC levels are strongly correlated with each other. This may indicate that the way this sensor is measuring is the same air particles.